Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features

نویسندگان

  • Mindy K. Ross
  • Ko-Wei Lin
  • Karen Truong
  • Abhishek Kumar
  • Mike Conway
چکیده

The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contribution to genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, effective use of the database is hindered by suboptimal study retrieval. Our objective is to evaluate text classification techniques to improve study retrieval in the context of the dbGaP database. We utilized standard machine learning algorithms (naive Bayes, support vector machines, and the C4.5 decision tree) trained on dbGaP study text and incorporated n-gram features and study metadata to identify heart, lung, and blood studies. We used the χ(2) feature selection algorithm to identify features that contributed most to classification performance and experimented with dbGaP associated PubMed papers as a proxy for topicality. Classifier performance was favorable in comparison to keyword-based search results. It was determined that text categorization is a useful complement to document retrieval techniques in the dbGaP.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text Studies Classification of Database of Genotypes and Phenotypes using K-Nearest Neighbor Algorithm

The database of genotypes and phenotypes (dbGaP) is the new database to store and distribute data from studies of genome wide association. dbGaP launch by National Library of Medicine (NLM) which is part of National Institutes of Health (NIH). Searching relevant studies of particular interest accurately and completely is challenging task due to keyword based search method of dbGaP Entrez system...

متن کامل

NCBI’s Database of Genotypes and Phenotypes: dbGaP

The Database of Genotypes and Phenotypes (dbGap, http://www.ncbi.nlm.nih.gov/gap) is a National Institutes of Health-sponsored repository charged to archive, curate and distribute information produced by studies investigating the interaction of genotype and phenotype. Information in dbGaP is organized as a hierarchical structure and includes the accessioned objects, phenotypes (as variables and...

متن کامل

NCBI’s Database of Genotypes and Phenotypes: dbGaP

The Database of Genotypes and Phenotypes (dbGap, http://www.ncbi.nlm.nih.gov/gap) is a National Institutes of Health-sponsored repository charged to archive, curate and distribute information produced by studies investigating the interaction of genotype and phenotype. Information in dbGaP is organized as a hierarchical structure and includes the accessioned objects, phenotypes (as variables and...

متن کامل

Determination of Alpha 1-Antitrypsin Phenotypes and Genotypes in Iranian Patients

Alpha 1-antitrypsin (AAT) or alpha 1-protease inhibitor (PI) is the principal inhibitor of proteolytic enzyme in serum. Its phenotypic variability has been reported to be associated with liver, lung diseases and rheumatoid arthritis in humans. There is much documentation about high risk phenotypes of PI in some regions of the world, however, there are no reliable reports on these phenotypes and...

متن کامل

از ژنوم تا ژن: مروری بر ژن‌ها و تغییرات ژنتیکی موثر بر بروز سندرم متابولیک

Background: The prevalence of non-communicable disorders such as metabolic syndrome (MetS) is high in developing countries. Metabolic syndrome is a disorder of energy utilization and storage, diagnosed by a co-occurrence of three out of five of the following medical conditions: abdominal (central) obesity, elevated blood pressure, elevated fasting plasma glucose, high serum triglycerides, and l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 6  شماره 

صفحات  -

تاریخ انتشار 2013